Regular Expressions

PreviousUpNext

Regular Expressions are a powerful way to define patterns for searching and matching.  Beyond Compare allows you to use regular expressions when searching through text, and when specifying rules for classifying text.  The regular expression support in Beyond Compare is a subset of the Perl Compatible Regular Expression (PCRE) syntax.

While Regular Expressions can be a complex topic, there are several excellent resources about them.  One such resource is a book called Mastering Regular Expressions.  Another excellent resource is Steve Mansour's A Tao of Regular Expressions, a copy of which can be found at:

    jmason.org/software/sitescooper/tao_regexps.html

A regular expression is composed of two types of characters:  normal characters and metacharacters.  When performing a match, metacharacters take on special meanings, controlling how the match is made and serving as wildcards.  Normal characters always match against only themselves.  To match against a metacharacter, escape it, by prefixing it with a backslash "\".  There are multiple types of metacharacters, each detailed below.

Metacharacters - Escape Sequences

Escape Sequence

Meaning

\xnn

character with the hex code nn

\x{nnnn}

character with the hex code nnnn

\x{F000}

character with a null value

\t

tab (0x09)

\f

form feed (0x0C)

\a

bell (0x07)

\e

ESC (0x1B)

Metacharacters - Predefined classes

Predefined character classes match any of a certain subset of characters.  The following classes are already defined for you.

Class

Meaning

.

match any character

\w

any alphanumeric character or _

\W

any non-alphanumeric character

\d

any numeric character (0-9)

\D

any non-numeric character

\s

any whitespace (space, tab)

You can also construct your own character classes by surrounding a group of characters in brackets "[]".  The predefined classes (except ".") can be used in the brackets, and if a dash "-" appears between two characters, it represents a range.  Thus [a-z] would represent all lowercase letters, and [a-zA-Z] would represent both lower and uppercase letters.  To include a "-" as part of the class, place it at the beginning or end of the string.

If the first character within the brackets is a caret "^", then the class represents everything except the specified characters.  [^a-z] matches on any character that isn't a lower-case alphabetic character.

Metacharacters - Alternatives

By placing an "|" between two groups of items, alternative matches can be represented.  a|b will match either a or b.  ab|cd will match "ab" or "cd", but not "ac".  "|" groups characters from pattern delimiter ("(", "[", or the start of the pattern) to itself and then again to the end of the pattern.  Alternatives can be placed within parenthesis "()" to make it obvious what is being matched against, as in a(bc|de)f.  Alternatives are matched left to right;  bey|beyond will match on bey, even if the string is "beyond".

Metacharacters - Position

The following metacharacters control where the match can occur on a line.  Note: \A and \Z match the start and end of text respectively, but since Beyond Compare performs the search on a line by line basis, these have the same effect as ^ and $.

Metacharacter

Meaning

^

match only at start of line

$

match only at end of line

Metacharacters - Iterators

Anything in a regular expression can be followed by an iterator metacharacter, which refers to the item before it.  There are two kinds of iterators - greedy and non-greedy.  Greedy iterators match as many as they can, non-greedy match as few as they can.

Greedy:

Metacharacter

Meaning

*

match zero or more of the preceding character (equivalent to {0,})

+

match one or more of the preceding character (equivalent to {1,})

?

matches zero or one times (equivalent to {0,1})

{n}

matches exactly n times (equivalent to {n,n})

{n,}

matches n or more times

{n,m}

matches at least n but no more than m times

Non-greedy:

Metacharacter

Meaning

*?

matches zero or more times

+?

matches one or more times

??

matches zero or one time

{n}?

matches exactly n times

{n,}?

matches at least n times

{n,m}?

matches at least n but no more than m times

Metacharacters - Subexpressions

Parenthesis "()" can also be used to group characters for use with iterators and backreferences (discussed below).  (bey){4,5} will match between 4 and 5 instances of "bey".  (abc|[0-9])* will match any combination of "abc" and the digits 0 to 9.  Eg. "abc5", "679abc" and "abc77abc".

Metacharacters - Back References

Each sequence of characters which is matched within a "()" will be saved as a subexpression, which you can refer to later with \1 to \9, which refer to the subexpressions from left to right.  b(.)\1n will match "been" and "boon", but not "bean", "ben" or "beeen".

Modifiers

Modifiers allow changes to the matching behavior from that point on.  If the modifier is contained within a subexpression, it affects only that subexpression.  Use (?i) and (?-i) to control the case sensitivity of matching.

Examples:

(?i)Beyond Compare

matches both "Beyond Compare" and "beyond compare"

(?i)Beyond (?-i)Compare

matches "Beyond Compare" and "bEyOnD Compare", but not "beyond compare"

 

See also

Regular Expressions Rename

Sample Regular Expressions